Skip to main content

Job Opportunities at the Internet Archive

Web Crawl Engineer

Location: Inner Richmond, San Francisco, CA or Remote

Job Classification: : Full-time, Exempt

Job Summary: The Internet Archive is seeking a Web Crawl Engineer for its Web Archiving Group. Our crawl engineering team is responsible for capturing and managing the highest quality content from the web. An ideal candidate demonstrates independence and initiative, is a problem solver, works well autonomously, and is technologically savvy. Additionally, the ideal candidate is open to being trained on, and helping advance, best practices and standards around large-scale web harvests, web data processing and engineering, and contributing to the development of new harvesting, access, and analysis tools.

The position will work in the Web Archiving Group in support of web harvesting services and programs working with partners ranging from national libraries and archives to collaborative international initiatives supporting the collection, preservation, and accessibility of web content. The role will help design the strategy and implementation of web archiving services using open source technologies and platforms, develop harvest techniques and tools to enable archival capture and re-rendering of rich media, streaming content, social media, as well as traditional web page content. The position will also create tools, services, and workflows to improve crawl analysis, reports, data management and derivation, and identify technical, operational and data analysis requirements. This role contributes to defining deployment architectures and workflows, managing data at scale, and monitoring production systems.

Essential Job Functions:

  • Running large-scale web harvests on global and national domain levels and focused and specialized crawls using Heritrix, our open-source crawler, as well as other open-source technologies developed internally, including Umbra, Brozzler, warcprox and others.
  • Configuration, monitoring, and improvement of large-scale, multi-machine web crawls to ensure their quality and timely completion.
  • Processing, analysis and quality assurance of archived web content to ensure it is complete and of the highest quality.
  • Contribute to development of tools for automated analysis and reporting of crawl material, and to development projects focused on crawling, processing, and access.
  • Manage both large ingests and exports of web data, derivatives, logs, and reports.
  • Demonstrated experience of delivering on commitments with deadlines and project time lines and working in a collaborative team of engineers and project/product managers.

Minimum Qualifications:

  • Experience with web crawlers or scrapers, especially Heritrix
  • Proven experience in Unix shell scripting and Python coding required
  • Solid experience in Internet protocols (HTTP is must.) Strong knowledge of HTML, JavaScript and Web technologies in general
  • Knowledge of building and deploying web applications, databases, web-host services, and knowledge of basic Linux system administration
  • Ability to work in, and enjoy, a loosely structured work environment

Preferred Qualifications:

  • Cluster computing experience is preferred, especially familiarity with Hadoop and related technologies and tools
  • Experience or familiarity with Java strongly preferred
  • Experience with applications designed to display archived web content, especially server-side apps and Wayback
  • Experience with development environments and system monitoring/administration tools
  • Experience with open source practices, version control, and code review
  • Experience with Atlassian tool sets
  • Flexibility and a sense of humor are a plus

Requirements: Bachelor's Degree in Computer Science or a related field, five years of progressively responsible experience in software development.

Reporting Structure: The Web Crawl Engineer reports to the Director of Web Archiving and works closely with other departments. The position works alongside other web archiving engineers as well as program staff in Web Archiving Group and with the broader Internet Archive infrastructure and engineering teams.

To Apply: Please send your resume and cover letter to jobs+crawlengineer@archive.org with the subject line "Web Crawl Engineer."

Internet Archive reserves the right to revise job descriptions or work hours as required.

Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Web Archiving Software Engineer

Location: Inner Richmond, San Francisco, CA or Remote

Job Classification: Full-time, Exempt

Job Summary: The Internet Archive has over 24PB of unique digital information, all running across an integrated cluster of over 700 VMs on 500+ bare-metal hosts in 3 data centers. We are looking for a smart engineer with experience in defining and building service APIs. The ideal candidate will also have experience creating software that interacts with systems at high transaction rates while delivering reliability and performance of both internal and public-facing web applications. All candidates must be able to work collaboratively within our Web Archiving team of talented engineers and program staff.

Essential Job Functions:

  • Build, test, and package APIs for the transfer of data out of a repository of web archive files
  • Consume external APIs to enable the ingest of external data into web archive files
  • Deploy, administer, and tune tools that support the software development infrastructure and data management and processing environments used within the Web Archiving group
  • Analyze, manage, transfer, and maintain large amounts of archival data in multiple environments
  • Participate in monitoring, maintaining, and restoring the health of the storage and computer cluster and key processes and services related to crawling, indexing, and access to archived web content

Minimum Qualifications:

  • Fluency in Linux environments, scripting and/or programming skills, development of custom tool integrations
  • Proven experience in Unix shell scripting and Python required
  • Demonstrated experience building or working with APIs
  • Experience deploying and administering database, search, and web-host services
  • Proven experience open source practices, participation in open source forums, and staying current with industry trends
  • BS in Computer Science, or equivalent work experience

Preferred Qualifications:

  • Familiarity configuration of software development environments and cluster administration tools, including Git, ELK stack and monitoring tools: Nagios, Graphite, Grafana, etc
  • Knowledge of evolving database or analytics tools, especially Hadoop, Druid, or RethinkDB
  • Experience or familiarity with Java is a plus
  • Experience with Atlassian tool sets
  • MS in Computer Science or equivalent work experience
  • Flexibility and a sense of humor

Reporting Structure: The Web Archiving Software Engineer reports to the Director of Engineering and works closely with the Director, Web Archiving Programs. The position will also work alongside other systems, applications, and QA engineers as well as program staff in Web Archiving Programs team.

To Apply: Please send your resume and cover letter to jobs+webarchivingengineer@archive.org with the subject line "Web Archiving Software Engineer."

Internet Archive reserves the right to revise job descriptions or work hours as required.

Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Manager: Site Reliability and Infrastructure

Location: Inner Richmond, San Francisco, CA and City of Richmond, CA; ON-SITE PRESENCE IN SF/RICHMOND IS REQUIRED! Remote employment is not available for this position

Job Classification: Full-time, Exempt

Job Summary: The Internet Archive has over 30PB of unique digital information... all running across an integrated cluster of over 700 VMs on over 600 "bare-metal" hosts in 2 data centers. We are looking for a "hands-on" operations manager with proven experience effectively managing and participating in a high-performance team of system administrators and technical operations staff. The ideal candidate will be looking to take on a "player-coach" role and have demonstrated experience improving and maintaining the reliability, performance, and security of both internal and publicly facing web infrastructure, online services, networks, and database systems. They must also be skilled in management communications and able to work collaboratively with our extended team of talented engineers and program staff.

Key Responsibilities:

  • Manage, contribute to, and mentor the technical team responsible for monitoring, maintaining, and restoring the health of all Internet Archive networks and online services. This includes all publicly-facing services, the storage and compute cluster, as well as key internal services related to crawling, indexing, and access to archived web content
  • Maintain and expand monitoring and reporting systems to communicate current and historical activity for multiple publicly facing Services and to ensure service continuity and performance.
  • Analyze, implement, and manage effective improvements in the maintenance and operations processes and infrastructure.
  • Plan and coordinate the transition of new software systems and service applications from a development into a production footing. This includes establishing procedures and policy that will ensure sustainable deployment, monitoring, upgrade and expansion of such services.
  • Assign, support, recruit, hire, schedule, and fire staff as needed to sustain operational objectives and efficiency.
  • Recommend the purchase of equipment needed to sustain responsive services and cost-effective operations.

Minimum Qualifications:

  • Experience managing large server cluster infrastructure
  • Experience as lead manager and mentor of a technical operations team
  • Passion and fierce advocate for the end user experience of web-delivered services
  • Experience in highly available 24x7 production environment.
  • Ability to "fire fight" personally and to document and share critical knowledge with others
  • "Customer Service" mentality - working proactively to identify and address user and co-worker challenges.
  • Passion for automation, data-driven decision making, and information reporting
  • Experience with high-bandwidth networking environments
  • Deep technical understanding of virtual hosts, containers, network architecture, DNS, DHCP
  • Work history that includes production-level programming in high-transaction environments.
  • Fluency in Linux system administration, Unix shell scripting, and familiarity with Python, PHP, etc.
  • Experience deploying and administering database, search, and web-host services
  • Experience establishing comprehensive monitoring and log analysis tools and infrastructure.
  • Excellent and creative problem solver. You don’t need to know everything but you need to know how to find the solution.
  • Experienced in open source practices and passion for staying current with industry trends
  • Willingness to travel to network operation centers and participate as necessary in physical equipment install
  • BS Computer Science, or equivalent work experience

Preferred Qualifications:

  • Strong experience with Ansible, Git, Nagios, Mesos, Redis, ELK stack, Kubernetes, etc.
  • Experience deploying and maintaining big-data analytics tools, especially Hadoop, Druid
  • Excellent oral/written communication and documentation skills
  • MS in Computer Science or equivalent work experience
  • Flexibility and a sense of humor

Reporting Structure: The Manager of Site Reliability and Infrastructure reports to the Director of Engineering and works closely with the Head Librarian and Founder.

To Apply: Please send your resume and cover letter to TechJobs@archive.org with the subject line "Manager of Site Reliability."

Internet Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Hardware Design Engineer

Job Summary: The Internet Archive (IA) is currently seeking a Hardware Design Engineer to develop and evolve the highly specialized digitization equipment used in IA book digitization centers, projects, and workflows worldwide. This position is a contract position with hours based on project demands, with possibility for expansion to full-time salaried position.

Responsibilities:

    The Hardware Design Engineer is responsible for the design, prototyping, manufacture, testing, and installation of IA's book digitization equipment to meet rigorous quality standards and support production expansion at multiple sites. The successful candidate will work on several projects related to all phases of design and manufacturing, including development of hardware requirements documents and specifications; hands-on support for metal fabrication and other prototyping methods as necessary; planning and implementation of tests for prototypes and production equipment; development of engineering sketches and CAD/CAM drawings to determine design factors for equipment based on requirements; oversight of manufacturing processes and vendors; analysis of hardware configurations for conformance to specifications; collaboration with software engineers on systems integration; documentation of equipment usage instructions; onsite equipment installation and testing as necessary; and continuous data-driven improvement of equipment to meet IA's organizational goals.

Requirements:

  • Advanced degree in mechanical engineering
  • Two to three years of mechanical engineering experience, preferably related to digitization equipment
  • Expertise with CAD (Solidworks) for mechanical design
  • Strengths in creativity and thinking outside the box
  • Technical rigor and detail orientation
  • Computer systems and networking knowledge and familiarity with unix environments
  • Software development skills and experience with python or other scripting languages preferred
  • Demonstrated ability to communicate effectively both verbally and in writing
  • Strong interpersonal skills
  • Capacity to travel on occasion to IA's digitization centers worldwide
  • Commitment to the greater good and to doing work that has a beneficial long-term impact on society
  • Flexibility and a sense of humor are essential

To Apply: Please send your resume and cover letter to hr@archive.org with the subject line "Hardware Design Manager."

Internet Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Digital Imaging Specialist

Job Summary: The Internet Archive (IA) is currently seeking a detail-oriented Digital Imaging Specialist with an engineering/photography background to ensure that defined standards for image quality are consistently and reliably achieved in IA book digitization centers, projects and workflows worldwide. This position is a contract position with hours based on project demands, with possibility for expansion.

Responsibilities:

  • The Digital Imaging Specialist develops, specifies, and evolves imaging quality criteria for book digitization and other digital imaging processes; tests and validates camera equipment and software to meet these criteria; develops and documents calibration and setup instructions for imaging equipment
  • Troubleshoots issues related to image quality of scanned texts
  • Performs active review of target images using Golden Thread software and other analyses
  • Documents equipment calibration settings and digitization workflows to achieve imaging quality criteria
  • Develops and codifies quality assurance standards and procedures
  • Collaborates with digitization managers on imaging-related processes
  • Contributes to digitization hardware design and development
  • Consults with IA staff and library partners on issues of image quality

Requirements: Two to three years of experience working in the field of digital imaging and highly developed technical skills are required. Expertise with a range of digital imaging and digital photography technologies is a must, including experience with digitization setup, digital camera and scanner technology, digitization workflows, imaging system troubleshooting, and image analysis and processing software. Preferred qualifications include:

  • Demonstrated experience with technical photography, lighting principles, device calibration, procedures and setup, preferably related to book digitization
  • Degree in mechanical or electrical engineering
  • Knowledge of image file formats, current and emerging imaging standards
  • Computer systems and networking knowledge and familiarity with unix environments
  • Scripting skills and experience with python or other scripting languages would be a plus
  • Demonstrated ability to communicate effectively orally and in writing
  • Strong interpersonal skills
  • Flexibility to work different shifts as necessary to get the job done
  • Commitment to the greater good and to doing work that has a beneficial long-term impact on society
  • Flexibility and a sense of humor are essential
  • To Apply: Please send your resume and cover letter via email to hr@archive.org with the subject line "Digital Imaging Specialist."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Head of Digitization

    Location: Location: San Francisco, CA, remote possible.

    Job Classification: : Full-time, Exempt

    Senior management position to manage and expand our digitization of millions of books, audio records, films and videotapes to build one of the world's largest digital libraries.

    Reporting to the Digital Librarian, the Head of Digitization will have overall strategic and operational responsibility for Internet Archive's 70+ digitization staff in 8 countries, programs, expansion, and execution of the group's mission.

    This requires managing people, setting up facilities, creating production processes, and working through process improvements.

    Responsibilities

  • Triple production rates and expand media types efficiently digitized
  • Build production processes and manage to them
  • Manage 70+ staff and volunteers working in libraries in multiple countries
  • Manage the contracts and operate a couple remote "super scanning center"
  • Work closely with our partner libraries and vendors
  • Identify and resolve bottlenecks within the workflow that hinders quality, partner satisfaction, and efficiency.
  • Develop new and interesting partnerships
  • Track and communicate production throughput and productivity.
  • Project manage hardware/software releases across all scanning operations
  • Develop strong working relationships with engineering, finance, administration, and HR teams
  • Qualifications:

  • Engineering mindset and approach to production processes
  • Worked internationally in setting up and operating factories
  • Track record of effectively leading and scaling a performance-based organization and staff
  • Unrelenting commitment to quality and efficiency
  • Desire to travel
  • Ability to work effectively with diverse groups of employees and library partners
  • Passion, integrity, positive attitude, mission-driven, and self-directed
  • Engineering degree with at least 5 years of senior management experience.
  • To Apply: Please send your resume and cover letter to hr@archive.org with the subject line "Head of Digitization."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Book Scanner

    Location: Location: Washington DC area

    Job Classification: : Full-time, Non-Exempt

    The Book Scanning Operator "Scanner" digitizes and helps de-bug the scanning process in the Internet Archive scanning centers. The Internet Archive has an immediate opening for a Scanner in the Washington DC area.

    Desired Qualifications:

  • High tolerance for repetitive tasks.
  • Attention to detail.
  • Ability to assess image quality and if a page has been skipped.
  • Average computer skills.
  • Willingness to do first level of troubleshooting.
  • Ability to communicate with others about problems or solutions.
  • Must be able to sit/stand at a scanning device constantly.
  • Patience and a natural curiosity about how things work is required.
  • Previous imaging experience is not necessary.
  • This is a non-exempt hourly position. Benefits include; medical, dental, FSA/DCA, 403B, LTD, life insurance.

    To Apply: Please send your resume and cover letter to hr@archive.org with the subject line "Scanner DC."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Senior Application Developer: Archive.org

    Location: San Francisco, CA

    Job Classification: Full-time, exempt

    Job Summary: The Internet Archive has a huge corpus of digital information. Every day, our team of development engineers creates tools and applications that help our users to access and work with 22 petabytes of content that includes millions of books and texts, millions of hours of video, millions of audio tracks, and over 450 billion web captures. We are looking for smart engineers to help develop next generation of web-based applications and tools that will be used by libraries and archives around the world to build and manage curated collections of books, texts, web, and image content. The ideal candidate will be a strong programmer who has successfully led and completed several projects involving large or intricate web applications or services, and who works collaboratively with talented engineering colleagues.

    Key Responsibilities:
    • The responsibilities of this position are to be part of the team that will maintain and evolve the Archive.org web site. More specifically, this means:
    • Work at the direction of the technical project lead to continue to evolve and enhance the next generation of the archive.org web site.

    Minimum Qualifications:

    • Passion for delivering delightful end-user experiences when interacting with delivered web applications and services.
    • Extensive work experience with Javascript, HTML5, and CSS.
    • Extensive experience developing applications and websites in PHP
    • Work history that includes integrating front end user interfaces with search, database , and business logic to create integrated applications and services.
    • Experience working with digital media files and metadata structures
    • Experience developing and maintaining structured APIs
    • Good understanding of latest web framework technologies and protocols
    • Fluency in Linux environments
    • Flexibility and a sense of humor

    Preferred Qualifications:
    • Strong programming experience Python.
    • Experience open source practices and participation in open source forums
    • Experience working with time-based digital media (audio and video).
    • Specific experience with Atlassian tool sets (Jira, Confluence)

    Reporting Structure:The Web Application Developer reports to the Director of Engineering and will work closely with the web archiving and TV archiving teams. The entire staff is guided by founder and Digital Librarian, Brewster Kahle.

    To Apply:Please send your resume and cover letter to Jobs+Seniorapplicationdeveloper@archive.org with the subject line "AE-106: Web Application Developer."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

    Senior Engineer: Wayback Machine

    Location: San Francisco, CA

    Job Classification: Full-time, exempt

    Job Summary:The Internet Archive's Wayback Machine is the world's largest public archive of historical web sites. Have you ever wanted to work with 450 billion things at once? Would you like to serve 1,500 requests per second? How about having your service referred to regularly in news articles and blog posts across the web? You can work on a challenging and popular project and help the world at the same time.

    We are looking for a smart, collaborative and resourceful engineer to help develop the next version of the Wayback Machine. The ideal candidate will possess a desire to work collaboratively with a small internal team and a large, vocal and active user community; demonstrating independence, creativity, initiative and technological savvy, in addition to being a great programmer/architect.

    Minimum Qualifications:

    • 2-3 years work experience in Python, or similar
    • Experience working in Linux environments
    • Familiarity with Java (current deployment is written in Java)
    • Good understanding of latest web framework technologies and aspects of web technology and protocols
    • Flexibility and a sense of humor
    • BS Computer Science, or equivalent work experience

    Preferred Qualifications:

    • Experience with web crawlers and/or applications designed to display archived web content (especially server-side apps)
    • Cluster computing experience
    • Open source practices experience

    To Apply: Please send your resume and cover letter to Jobs+SeniorWaybackEngineer@archive.org with the subject line "Wayback Machine Senior Engineer."

    Internet Archive reserves the right to revise job descriptions or work hours as required.

    Internet Archive is an Equal Opportunity Employer and a 501(c)(3) non profit library founded in 1996.

    The Archive will consider for employment-qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.